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Abstract 

Direct-summation iV-body algorithms compute the gravitational interaction be- 
tween stars in an exact way and have a computational complexity of 0(N 2 ). Per- 
formance can be greatly enhanced via the use of special-purpose accelerator boards 
like the GRAPE-6A. However the memory of the GRAPE boards is limited. Here, 
we present a performance analysis of direct iV-body codes on two parallel super- 
computers that incorporate special-purpose boards, allowing as many as four million 
particles to be integrated. Both computers employ high-speed, Infiniband intercon- 
nects to minimize communication overhead, which can otherwise become significant 
due to the small number of "active" particles at each time step. We find that the 
computation time scales well with processor number; for 2 x 10 6 particles, efficiencies 
greater than 50% and speeds in excess of 2TFlops are reached. 
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1 Introduction 



Num erical algorithms for solving the gravitational iV-body problem (JAarseth 
2003[ ) have evolved along two basic lines in recent years. Direct-summation 
codes compute the complete set of iV 2 interparticle forces at each time step; 
these codes are designed for systems in which the finite- N graininess of the 
potential is important or in which binary- or multiple-star systems form, and 
until recently, were limited by their 0(N 2 ) scaling to moderate (N < 10 5 ) 
particle numb ers. The best-k nown examples are the NBODY series of codes 
introduced by Aarseth ( 1999j ) and th e Starlab environment deve loped by 
McMillan, Hut, and collaborators (e.g. iPortegies Zwart et al.1 (|200lh ). 



A second class of A^-body algorithms replace the direct summation of forces 
from distant par ticles by an approxima tion scheme. Examples are the Barnes- 
Hut tree code ( Barnes and Hut 1986), which reduces the number of force 
calculations by subdividing particles into an oct-tree, and fast multipole al- 
gorithms which represent the large-scale potential via a truncated basis-se t 
expansion (Ivan A lbada an d van Gorkoidll977i iGre engard and Rokhlin 1987), 
or on a grid ((Miller and PrendergastJ 19681 : Efstathiou and Eastwood! 1981 ) . 
These algorithms have a milder, O(NlogN) or even O(N) scaling for the 
force calculations and can handle much larger particle numbers, although their 
accuracy can b e substantially lower than that of the direct-summation codes 
( Spurzem 1999b . The efficiency of both sorts of algorithm can be considerably 
i ncreased by th e use of individual time steps for advancing particle positions 
( Aarsethll2003h . 



The A^-body problem is particularly challenging in the case of dense stellar sys- 
tems like galactic nuclei, whic h may contain single or multiple, supermassive 
black holes in addition to stars ( Ferrarese and Fordl2005| : Merritt and Milosavlievic 
20051 ). Particle advancement must be very accurate for trajectories that pass 
near the black hole(s). In addition, galactic nuclei are often "collisional" in 
the sense that gravitational encounters (i.e. scattering) can redistri bute en 



ergy between stars on time scales less than the age of the universe ((Merritt 



2006). Simulating the evolution of a collisional nucleus is probably not fea- 
sible using tree or grid algorithms due to their limited accuracies. While al- 
ternative, highly-efficient algorithms based on the Fokker - Planck or fluid for- 
malisms have been widely us ed to model galactic nuclei ((Louis and Spurzem 
1991 : Freitag and Benzl 2001 ). these methods can not deal with systems that 
are far from dynamical equilibrium or that fail to respect spatial symmetries. 



A natural way to increase both the speed and particle number in an A '-body 
simulation is to parallelize ( Dubinskil 19961 : Pearce and Couchman 1997 ) . Par- 
allelization on general-purpose supercomputers is difficult, however, because 
the calculation cost is often dominated by a small number of particles in a sin- 
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gle dense region, e.g. the nucleus of a galaxy. Communication latency becomes 
the bottleneck: the time to communicate particle positions between processors 
can exceed the time spent computing the forces. The best such schemes use 
systolic algorithms (in which the particles are rhythmically passed around 
a ring of processors) coupled wit h non-blocking communication bet ween the 
processors to reduce the latency (|Makinoll20ni borband et al.ll200ah . 



A major breakthrough in direct-summation iV-body simulations came in the 
late 199 0s with the development of the GRAPE series of special-purpose com- 
puters ( Making and Taiii Il998l ). which achieve spectacular speedups by im- 



plementing the entire force calculation in hardware and placing many force 
pipelines on a single chip. The GRAPE-6, in its standard implementation (32 
chips, 192 pipelines), can achieve sustained speeds of about 1 Tflops at a cost 
of just ~ $50K. In a standard setup, the GRAPE-6 is attached to a single 
host workstation, in much the same way that a floating-point or graphics ac- 
celerator card is used. Advancement of particle positions [O(N)] is carried 
out on the host computer, while interparticle forces [(9(A 2 )] are computed on 
the G RAPE. More recently, "mini-GRAPEs" (GRAPE-6A) (|Fukushige et al " 



2005f ) have become available, which are designed to be incorporated into the 



nodes of a parallel computer. The mini-GRAPEs place four processor chips 
on a single PCI card and delivering a theoretical peak performance of ~ 131 
Gflops for systems of up to 128k particles, at a cost of ~ $6K. By incorpo- 
rating mini-GRAPEs into a cluster, both large (> 10 6 ) particle numbers and 
high (> 1 TFlops) speeds can in principle be achieved. 

In this paper, we describe the performance of direct-summation iV-body al- 
gorithms on two computer clusters that incorporate GRAPE hardware. §2 
describes the hardware implementations. The parallel TV-body code and its 
implementation on the GRAPEs is described in §3. §4 presents the results of 
performance tests using realistic galaxy models, and §5 describes a theoretical 
performance model that reproduces the observed performance and which can 
be used to predict the performance of similar codes on different clusters. §6 
summarizes our results and discusses directions for future work. 



2 Hardware 



2.1 The GRAPE technology 



The GRAPE-6A board (Fig. 1) is a standard PCI short card on which a pro- 
cessor, an interface unit, and a power supply are integrated. The processor 
is a module consisting of four GRAPE-6 processor chips, eight SSRAM chips 
and one FPGA chip. The processor chips each contain six force calculation 
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pipelin es, a predictor pipe line, a memory interface, a control unit, and I/O 
ports (|Makino et alJbOOaj) . The SSRAM chips store the particle data. The 
four GRAPE chips can calculate forces, their time derivatives and the scalar 
gravitational potential simultaneously on a maximum of 48 particles at a time; 
this limit is set by the number of pipelines (six force calculation pipelines each 
of which serves as eight virtual multiple pipelines). There is also a facility to 
calculate neighbor lists from predefined neighbor search radii; this feature is 
not used in the algorithms presented below. The forces computed by the pro- 
cessor chips are summed in an FPGA chip and sent to the host computer. A 
maximum of 131 072 (2 17 ) particles can be stored in the GRAPE-6A memory. 
The peak speed of the GRAPE-6A is 131.3 GFlops (when computing forces 
and their derivatives) and 87.5 GFlops (forces only), assumi ng 57 floating- 
point operations per force calculation ((Fukushige et al.ll2005f ). The interface 
to the host computer is via a standard 32 bit/33 MHz PCI bus. The FPGA chip 
(Altera EP1K100FC256) realizes a 4-input, 1-output reduction when transfer- 
ring data from the GRAPE-6 processor chip to host computer. The complete 
GRAPE-6A unit (Fig. 1) is roughly 11 cm x 19 cm x 7 cm in size. 5.8 cm of 
the height are taken up by a rather bulky combination of cooling body and 
fan, which may block other slots on the main board. Possible ways to deal with 
this include the use of even taller boxes for the nodes (e.g. 5U) together with a 
PCI riser of up to 6 cm, which would allow the use of slots for interface cards 
beneath the GRAPE fan; or the adoption of the more recent, flatter designs 
such as that of the GRAPE6-BL series. The reader interested in more tech- 
nical details should seek advice from the GRAPE (http://astrogrape.org) 
and Hamamatsu Metrix (http : /www.metrix . co . jp) websites. 



2.2 The GRAPE cluster 



A computer cluster incorporating GRAPE-6A boards became fully operational 
at the Rochester Institute of Technology (RIT) in February 2005 (Fig. 2). This 
cluster, named "gravitySimulator," consists of 32 compute nodes plus one head 
node each containing dual 3 GHz-Xeon processors. In addition to a standard 
Gbit-ethernet, the nodes are connected via a low-latency Infiniband network 
with a transfer rate of lOGbits -1 . The typical latency for an Infiniband net- 
work is of the or der of 10 -6 s, or a factor ~ 100 better then the Gbit-Ethernet 
( Liu et al. 2003). A total of 14TByte of disk space is available on a level 5 
RAID array. The disk space is equivalent to 2.5 x 10 5 A^-body data sets each 
with 10 6 particles. The disks are accessed via a fast Ultra320 SCSI host adapter 
from the head node or via NFS from the compute nodes, which in addition 
are each fitted with a 80 Gbyte hard disk. Each compute node also contains 
a GRAPE-6A PCI card (Fig. 1). The total, theoretical peak performance is 
approximately 4TFlops if the GRAPE boards are fully utilized. Total cost 
was roughly $0.45M, roughly 1/2 of which was used to purchase the GRAPE 
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Fig. 1. Interior of a node showing a GRAPE-6A card (note the large black fan) and 
an Infiniband card. 

boards. 

Some special considerations were required in order to incorporate the GRAPE 
cards into the cluster. Since our GRAPE-6A's use the relatively old PCI inter- 
face standard (32 bit/33 MHz), only one motherboard was found, the SuperMi- 
cro X5DPL-iGM, that could accept both the GRAPE-6A and the Infiniband 
card. (A newer version of the GRAPE-6A that uses the faster PCI-X technol- 
ogy is now available.) The PC case itself has to be tall enough (4U) to accept 
the GRAPE-6A card and must also allow good air flow for cooling since the 
GRAPE card is a substantial heat source. The cluster has a total power con- 
sumption of 17 kW when the GRAPEs are fully loaded. Cluster cooling was 
achieved at minimal cost by redirecting the air conditioning from a large room 
toward the air-intake side of the cluster. Temperatures measured in the PC 
case and at the two CPUs remain below 30 C and 50 C, respectively. 

A similar cluster, called "GRACE" (GRAPE + MPRACE), was recently in- 
stalled in the Astronomisches Rechen-institut (ARI) at the University of Hei- 
delberg. There are two major differences between the RIT and ARI clusters. 1) 
Each node of the ARI cluster incorporates a reconfigurable FPGA card (called 
"MPRACE") in addition to to the GRAPE board. MPRACE is optimized to 
compute neighbor forces and other, non-Newtonian forces between particles, 
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Fig. 2. gravitySimulator, the GRAPE cluster at RIT. The head node and the 
14 Tbyte raid array are visible on the first rack. The other four racks hold a to- 
tal of 32 compute nodes, each equipped with a GRAPE-6A card. 

in order to accelerate calculations of molecular dynamics, smoothed-particle 
hydrodynamics, etc. 2) The newer main board SuperMicro X6DAE-G2 was 
used, which supports Pentium Xeon chips with 64 bit technology (EM64T) 
and the PCIe (PCI express) bus. This made it possible to use dual-port Infini- 
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band interconnects via the PCI express Infiniband x8 host interface card, used 
in the xl6 Infiniband slot of the board (it has another x4 Infiniband slot which 
is reserved for the MPRACE-2 Infiniband card). As discussed below, the use 
of the PCIe bus substantially reduces communication overhead. Benchmark 
results presented below for the ARI cluster were obtained from algorithms 
that do not access the FPGA cards. 



3 The parallel A^-body code 



We present here a new direct-summation code, called yjGRAPE, which was op- 
timized for perfoma nce on GRAPE clusters. Th e algorithm employs a Hermite 
integration scheme ( Makino and Aarseth 1992| ) with hierarchical, commensu- 
rate block time steps. Although comp arable in comp lexity to - and inspired 
by - Aarseth's serial NBODYl code (|Aarsethl fl999l) , 93GRAPE was written 
"from scratch" for the RIT GRAPE cluster. 1 



3.1 Integration scheme 



In addition to position Xj, velocity v$, acceleration aj, and time derivative of 
acceleration a,;, each particle i has its own time tj and time step Atj. 

Integration consists of the following steps: 

(1) The initial time steps are calculated from 

Ati = Vs^\, (1) 

where typically r] s = 0.01 gives sufficient accuracy. 

(2) The system time t is set to the minimum of all U + Ati and all particles % 
that have tj + At i = t are select e d as a ctive particles. Note that "classical" 
N-body codes (lAarsethl Il99fll l2003h employ a sorted time-step list to 
select short time step particles efficiently. Such sorting is abandoned in 
favor of a search over all N particles, which improves load balance in the 
parallel code due to the more random distribution of short and long step 
particles. 

(3) Positions and velocities at the new t are predicted for all particles using 



x 



(/ tj) 2 (t - tj) 3 . 
- x ; . () + {t — tj)v jfi H a,. n H a j)0 (2) 



1 The code will be publicly available; see http://grapecluster.rit.edu/ 
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and 



v j,p = v j,o + (t - tjfefl + a,- . 



(3) 



Here, the second subscript denotes a value given either at the beginning 
(0) or the end (1) of the current time step. All quantities used in the 
predictor can be calculated directly, i.e. no memory of a previous time 
step is required. 

(4) Acceleration and its time derivative are updated for active particles only 
according to 



j^i V ij ~ c J 



(4) 



and 



<l 3 



+ 



( r 2 +e 2)(3/2) ( r 2 +e 2)(5/2) 



(5) 



where 



and e is the softening parameter. 
(5) Positions and velocities of active particles are corrected using 



(6) 
(7) 



Atf (2) (3) 



(8) 



and 



v u = v„ P + M aS + M a <3» i 



(9) 



where the second and third time derivatives of a are given by 



(2) _ -6 (a^p - a^i) - Atj (4a^ + 2a M ) 



(3 ) _ 12 (a ii0 - 3Li,i) + 6AU (k ifi + a^i) 



H,0 



At? 



(10) 
(11) 
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(6) The times U are updated and the new time steps At i are determin ed. 
Time steps are calculated using the standard formula (Aarseth 1985f ): 



\ 



l ll ( 2 ) l 

l a «,l I l a i,l I 



a i,l 



I • II (3) I 
l a 'M I l a i 1 I 



+ i a Si 2 ' 



(12) 



The parameter 77 controls the accuracy of the integration and is typically 

(2) 

set to 0.02. The value of a) { is calculated from 



„ (2) _ _(2) , Ay. „(3) 



(13) 



and api is se t to a^ 



.0 • 



(7) Repeat from step (2). 



A hierarchical commensurate block time step scheme is necessary when the 
Hermite integrator is used with the GRAP E (and is a l so eff icient for paral- 
lelization and vectorization; see below and iMcMillanl l|l98fih ). Particles are 
grouped by replacing their time steps At, with a block time step At^b = 
(1/2)™, where n is chosen according to 



< AU < 



n—l 



(14) 



The commensurability is enforced by requiring that t/ Ati be an integer. For 
numerical reason we also set a minimum time step At min , where typically 



At mhi = 2- 23 . (15) 



The time steps of particles with Ati < At min are set to this value. The mini- 
mum time step should be consistent with the maximum acceleration defined by 
the softening parameter; monitoring of the total energy can generally indicate 
whether this condition is being violated. 



3.2 GRAPE implementation 



The GRAPE-6 and GRAPE-6A hardware has been designed to work with 
a Hermite integration scheme and is therefor e easily integrated i nto the al- 
gorithm described in the previous section (see iMakino et al~ll2003h . In detail, 
integration of particle positions using the GRAPE-6A consists of the following 
steps: 
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(1) Initialize the GRAPE and send particle data (positions, velocities, etc.) 
to GRAPE memory. 

(2) Compute the next system time t and select active particles on the host 
(same as step 2 in previous section). 

(3) Predict postions and velocities of active particles only and send the pre- 
dicted values together with the new system time t to GRAPE's force 
calculation pipeline. 

(4) Predict positions and velocities for all other particles on the GRAPE, 
and calculate forces and their time derivatives for active particles. 

(5) Retrieve forces and their time derivatives from the GRAPE and correct 
postions and velocities of active particles on the host. 

(6) Compute the new time steps and update the particle data on the host 
of all active particles in the GRAPE memeory. 

(7) Repeat from step (2). 



3. 3 Parallelization 

Two basic schemes have been used to implement parallel force computations 
for 0(N 2 ) problems on general-purpose computers. The simplest case to con- 
sider is when all particles have the same fixed time step. 

Replicated data algorithms (also called "copy" or "broadcast" algorithms). 
Each compute node has a copy of the whole system but is assigned a spe- 
cific subset of N/p particles, where p is the number of processors. At every 
step, each node computes the forces exerted by all iV particles on its subset. 
These particles are then advanced and their updated positions and velocities 
are sent to the other processors. 

Systolic algorithms (also called "ring" algorithms). At the start of the inte- 
gration, each node is permanently assigned a subset of N/p particles. At each 
step, these sub-arrays are shifted sequentially to the other nodes where the 
partial forces are computed and stored. After p — 1 such shifts, all of the force 
pairs have been computed and the particles are returned to their original nodes 
where their trajectories can be advanced. 

Both the replicated and systolic algorithms exhibit an 0(N log p) scaling in 
communication complexity and an 0(N 2 ) scaling in number of force compu- 
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tations. The systolic algorithm makes more efficient use of m emory; however 



memory limitations are typi cally not restrictive for N < 10 6 ([Dorband et al 
20031 lOualandris et al.ll2005h . 



The performance of parallel algorithms can be substantially degraded however 
if the iV-body system has a "core-halo" structure, i.e. a dense central region 
surrounded by a low-density envelope. A galaxy containing a central black hole 
is an extreme example. Individual time steps (Equation 12) are mandated 
in this case, and the group size - the number of active particles due to be 
advanced at every time step - can be much smaller than N; indeed it can often 
be smaller than p. In the latter case, the systolic algorithm suffers since only a 
fraction of the nodes are active at a given time. Nonblocking communication 
is an effective way to deal with this problem since it allows communication to 
be put "in the background" so that computing node s can send/receive data 
and calculate at the same time (|Dorband et al.ll2003h . 



Adding the GRAPE hardware imposes additional constraints. The GRAPE 
memory holds only Nq ~ 10 5 particles, hence the copy algorithm becomes 
inefficient for large N. The systolic algorithm is a natural alternative, allowing 
a total of Nq x p particles to be stored in the collective GRAPE memories. But 
a problem arises with the systolic algorithm if the number of active particles 
on any node is less than 48, since the time required by the GRAPE to compute 
forces on one particle is the same as the time to compute the forces on 48. 

Our solution was to adopt a hybrid scheme. Nodes are initially assigned a 
subset of N/p particles as in the systolic algorithm, where N/p < Nq, ensuring 
that all N/p particles can be stored in the GRAPE memory. However, once 
the active particles on each node have been identified, they are broadcast to 
all the other nodes, thus minimizing the possibility that any one GRAPE will 
be required to compute forces on less than 48 particles. 2 

In detail, our parallel algorithm works as follows: 

(1) Distribute the particle data to all nodes such that each node receives 
N/p particles. 

(2) Initialize the GRAPE card on each node and send the local particle data 
to the GRAPE memories. 

(3) Compute the minimum time step on each node and use allreduce to 
find the global minimum. 

(4) Select the active particles on each node and predict their positions and 
velocities. 



Fukushige et al.l (|2005r ) have adopted a similar scheme. 
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(5) Collect the particle data (including the predicted values) of all active 
particles onto all nodes using allgather. 

(6) Compute the partial forces on each node for the global set of active 
particles using the GRAPE. 

(7) Retrieve the local partial forces, which are summed to get the total 
forces using allreduce. 

(8) Correct positions and velocities for the local active particles on each 
node and update the GRAPE memory. 

(9) Repeat from step (3). 



4 Performance tests 



We evaluated the performance of this algorithm on the two GRAPE clusters 
described above. 



4-1 N-body models 



Initial conditions for the performance tests were produced by generating Monte- 
Carlo positions and velocities from self-consistent models of stellar systems. 
Each of these models is spherical and is completely described by a steady-state 
phase-space distribution function f(E) and its self-consistent potential ^(r), 
where E = v 2 /2 + ^ is th e particle energy and r is t he distance f rom the 
center. The models were: a Plummerl (Il911) sphere; tw o King ( 19661 ) models 
with different concentrations; and two Dehnen ( 19931 ) models with different 
central density slopes. The Plummer model has a low central concentration 
and a finite central density; it does not accurately represent any class of stel- 
lar system but is a common test case. King models are defined by a single 
dimensionless parameter Wq describing the central concentration (e.g. ratio of 
central to mean density); we used Wn = 9 and Wo = 12 which are appropriate 
for globular star clusters ((Spitzer 1987 ). Dehnen models have a divergent in- 
ner density profile, p oc r~ 7 . We took 7 = 0.5 and 7 = 1.5, which correspond 
approximately t o the inner density pro files of bright and faint elliptical galax- 



ies respectively (IGebhardt et al 



Milky Way galaxy has p ~ r -15 (|Genzel et aTll2003l) 



996); in particular , the central bulge of the 



In what follows we adopt standard iV-body units G = M = —4E = 1, where 
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Fig. 3. Energy conservation in each of the five test models, for four TV-values. Rel- 
ative errors have been shifted by multiples of 1CT 4 for each model. From top to 
bottom, each group of four lines represents Plummer models; Dehnen modes with 
7 = 0.5 and 7 = 1.5; and King models with Wo = 9 and Wo = 12. The results 
for N = 4K (full line), 64 K (dotted), 256 K (dashed) and 512 K (dash-dotted) are 
shown in each group. 

G is the gravitational constant, M the total mass and E the total energy of 
the system. In some of the models, the initial time step for some particles was 
smaller than the minimum time step t min set in Eq. 15. These models were 
then rescaled to change the minimum time step to a large enough value. Since 
the rescaling does not influence the performance results we will present all 
results in the standard iV-body units. 

We realized each of the five models with 11 different particle numbers, iV = 
2 k , k = [10, 11,.. .,20], i.e. N = [1 K, 2 K, 1 M]. 3 We also tested Plummer 
models with N = 2M and N = 4 M; the latter value is the maximum N value 
allowed by filling the memory of all 32 GRAPE cards. Thus, a total of 57 test 
models were used in the timing runs. 

Two-body relaxation, i.e. exchange of energy between particles due to gravi- 



3 Henceforth we use K to denote a factor of 2 10 = 1024 and M to denote a factor 
of 2 20 = 1,048,576. 
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Fig. 4. Lagrange radii for Dehnen models (7 = 1.5) with different N. The results for 
4 096 (dotted), 131072 (dashed) and 1048 576 particles (full line) are shown. This 
is the densest (i.e. most centrally concentrated) of the test models are represents 
appoximately density profile of the Milky Way bulge. Lagrange radii of all other 
tested models were more stable. 



tational scattering, induces a slow change in the characteristics of the models. 
In order to minimize the effects of these changes on the timing runs, we inte- 
grated the models for only one time unit. The softening e was set to zero for 
the Plummer models and to 10~ 4 for the Dehnen and King models. The time 
step parameters were r] s = 0.01 and 77 = 0.02. 



Figs. 3 and 4 show the dependence on time of the total energy, and the La- 
grange radii, for the models. In all models, the maximum relative deviation in 
total energy is of the order of 10~ 5 or less. The Lagrange radii (shown only for 
Dehnen models with 7 = 1.5, the most centrally concentrated of the models 
which we considered) show that the mass profiles of all models remain practi- 
cally unchanged. A noticable, but small, change in the innermost region can 
be seen only for the lowest particle numbers. 



14 




Fig. 5. Wallclock time w versus particle number N for different numbers of proces- 
sors p. The plots in the top row show the results for a Plummer model on the RIT 
(left) and ARI (right) clusters. The remaining plots show wallclock times of Dehnen 
models with 7 = 0.5 (middle left) and 7 = 1.5 (middle right) and of King models 
with W = 9 (bottom left) and W = 12 (bottom right). 

4-2 Performance results 



We analyzed the performance of the hybrid scheme as a function of particle 
number and also as a function of number of nodes; we used p = 1, 2, 4, 8, 16, 
and 32 nodes. The compute time w for a total of almost 350 test runs was 
measured using MPI_Wtime(). The timing was started after all particles had 
finished their initial time step and ended when the model had been evolved 
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Fig. 6. Speedup s versus processor number p for different particle numbers N. The 
plots in the top row show the results for a Plummer model on the RIT (left) and 
ARI (right) clusters. The remaining plots show the speedup for Dehnen models with 
7 = 0.5 (middle left) and 7 = 1.5 (middle right) and for King models with Wo = 9 
(bottom left) and Wq = 12 (bottom right). 

for one time unit. No data output was made during the timing interval. 

Fig. 5 shows wallclock times wn, p from all integrations on the RIT cluster 
as a function of particle number N and processor number p. We also show 
results from just the Plummer models on the ARI cluster. For any p, the clock 
time increases with N, roughly as A^ 2 for large N. However when A" is small, 
communication dominates the total clock time, and w increases with increaing 
number of processors. This behavior changes as A" is increased; for A" > 10 K 
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Fig. 7. Efficiency e vs. number of processors p for different particle numbers TV. The 
plots in the top row show the results for a Plummer model on the RIT (left) and 
ARI (right) clusters. The remaining plots show efficiencies for Dehnen models with 
7 = 0.5 (middle left) and 7 = 1.5 (middle right) and for King models with Wo = 9 
(bottom left) and Wq = 12 (bottom right). 

(the precise value depends on the model), the clock time is found to be a 
decreasing function of p, indicating that the total time is dominated by force 
computations. The clock time is longer for the more centrally concentrated 
models since smaller time steps are required. As expected, the ARI cluster 
is faster than the RIT cluster by about 10% due to its newer hardware and 
better communication. 



The centrally concentrated King and Dehnen models tend to have very small 
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Fig. 8. Speed / vs. particle number N for different numbers of processors p. The 
plots in the top row show the results for a Plummer model on the RIT (left) and 
ARI (right) clusters. The remaining plots show the speed for Dehnen models with 
7 = 0.5 (middle left) and 7 = 1.5 (middle right) and for King models with Wo = 9 
(bottom left) and Wq = 12 (bottom right). 



block sizes at small (N < 10 4 ) particle numbers. If such systems are integrated 
using the larger processor numbers, specific features related to the hardware 
and software implementation of communication (latencies) turn up, which are 
otherwise hidden by the dominating effect of large computation and commu- 
nication, e.g. bandwidth and CPU speed. We will not discuss these effects in 
detail here although their influence can be discerned in the details of Figs. 5-9. 
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Fig. 9. Speed ratio r vs. particle number N for different numbers of processors p. 
The plots in the top row show the results for a Plummer model on the RIT (left) and 
ARI (right) clusters. The remaining plots show the speed ratio for Dehnen models 
with 7 = 0.5 (middle left) and 7 = 1.5 (middle right) and for King models with 
W Q = 9 (bottom left) and W = 12 (bottom right). 

The speedup for selected test runs is shown in Fig. 6. Speedup s is denned as 

Wn ^ ft*\ 

sn, p = • (16) 

w NyP 



The ideal speedup (optimal load distribution, zero communication and la- 
tency) is sn, p = P- For particles numbers iV > 128 K the wallclock time u»jv,i 
on one processor is undefined as iV exceeds the memory of the GRAPE card. 
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In that case we used wn,i = wi28K,i - (-^/128 K) 2 assuming a simple iV 2 -scaling. 
In general, the speedup for any given particle number is roughly proportional 
to p for small p, then reaches a maximum before dropping at large p. The 
number of processors at the the point of maximum speedup is "optimum" in 
the sense that it provides the fastest possible integration of the given problem. 
The optimum p is roughly the value at which the sum of the communication 
and latenc y times equals the for ce computation time; in the zero-latency case, 
p opt oc N ijDorband et alJl2003h . Fig. 6 shows that for N > 128 K, p opt > 32 



for all the tested models. 
Efficiency (Fig. 7) e is defined by 

e N , P = • 17 

p 

Again, the comparison of the Plummer model on the clusters shows that the 
ARI cluster performs better. On 32 nodes the efficiency can be as high as 0.6 
for the highest N. Also, the efficiency does not vary much for models that 
have different central concentrations. 

As mentioned before, the theoretical peak performance of a single GRAPE 
card (or node) is /i )max ~ 131 GFlops. We determined the compute speed / 
from the measured total number of force updates Nf in each run. For each 
force update N forces are calculated, i.e. the compute speed is 

N ■ N f 

f N , P = 57 f -, (18) 



which assumes 57 floating point operations per force calculation. The measured 
compute speed is shown in Fig. 8. The maximum speeds reached are 2.2 TFlops 
on the RIT cluster and 3.2 TFlops on the ARI cluster. 

The speed ratio r is given by 

r N , P = (19) 

P Jl,max 



and shown in Fig. 9. The speed ratio reached a maximum of ~ 0.8 and ~ 
0.9 on the RIT and ARI clusters respectively. This shows again the benefits 
of the newer hardware in the ARI cluster. There are several reasons why 
the theoretical peak speed cannot be reached. Under a realistic time step 
distribution, it is impossible to keep the 48 pipelines on every GRAPE fully 
loaded. Evidence for this is seen in the lower speed ratios of the models with the 
more centrally-concentrated density distributions, like the Dehnen models, in 
which block sizes are typically small. In addition, the communication between 
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the host and the GRAPE requires a non-negligible overhead of order of 10%. 
Communication also detracts from the performance; this can be seen in the 
slightly better performance of the ARI cluster, a result of its faster network. 



5 Performance modeling 



In this section we present a theoretical performance model for the execution 
time of a direct iV-body code on a GRAPE cluster. We first consider the 
performance of a sequential code, which contains the essential elements of the 
performance of the GRAPE hardware, then consider the performance of the 
hybrid parallel scheme described in §3. 



5.1 Performance modeling of the sequential GRAPE code 



If we consider a sequential block time-step code to be used in combination 
with GRAPE-6A, the time required to advance one active particle for one 
integration step can be written as 

T (1) = T host (1) + T grapc (1) + T comm (1) (20) 



where T host (1) = T prod (1) + T corr (1) is the time spent on the host for the pre- 
dictor and corrector operations, T grape (1) = NT pipe is the time spent on the 
GRAPE for the force calculation and T comm (1) is the time spent in communi- 
cation between the host and the G RAPE. The communic ation time between 
GRAPE and host has three terms ( Fukushige et al. 2005| ): 



T comm (l) = 60 ti + 56 1/ + 72 tj 



(21) 



where the first term represents the time to send the predicted positions and 
velocities to the GRAPE, the second term is the time to retrieve acceleration, 
jerk, and potential from the GRAPE, and the third term represents the time 
to send new data to the GRAPE memory for update. Table 1 reports the 
parameters measured on one GRAPE6-A of the RIT cluster. 

The times T pTe ^ for the predictor and T corr for the corrector are measured on the 
host node. The parameter tj is derived by measuring the time T send to send the 
data relative to one particl e to the GRAPE memo ry: tj = T scn d/72. We then 
assume tj, — tf — tj as in iFukushige ei~aT1 (|2005h . The GRAPE parameter 



T pipe is not measured directly but derived from the total time for the force 
calculation by subtracting the time for communication between the host and 
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Table 1 



Performance parameters of one GRAPE-6A board 



Tpred (1) 


^corr (1) 


Tpipe tj 


(1.1 ±0.2) X 1(T 7 


(3.4 ±0.3) x 10~ 7 


(2.2 ±0.5) x 10~ 8 (3.1 ±0.5) x 10" 8 



the GRAPE. This approach is necessary since the measured time Tf orce for the 
force calculation contains both the time for the force computation and the 
communication time between host and GRAPE. In this way T pipc is given by 
Tpip e = (Tf orce — (60 tj + 56 tf)) /N. The total wallclock time for one particle 
step is shown in Fig. 10 for different particle numbers, and compared with 
timing data on one GRAPE-6A. The agreement between the model and the 
data is very good for particle numbers larger than about IK. For smaller N, 
there is a small deviation of the model from the data. The deviation is due to 
a slightly different value of T pipe when the GRAPE memory holds a very small 
number of particles. Given the fact that we are not interested in such small 
particle numbers, we ignore this effect and consider a fixed T pipe throughout 
our analysis. 



0.01 h J 




10 100 1000 10" 10 5 

N 

Fig. 10. Time for one particle step as a function of the number of particles. The 
solid line indicates the theoretical prediction while the data points represent timing 
results on one GRAPE-6A. 

The total time required to advance a block of active particles of size s can 
then be written as 

T (s) = T host (s) + (s) + (s) (22) 

where 

Thost (s) = T prcd (s) + T corr (s) = T host (1) s , 
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T 




Fig. 11. Times, predicted by the theoretical model, spent on the host, the GRAPE 
and in communication as a function of the block size for a sequential GRAPE 
code with block time steps. Data points are actual timing measurements on one 
GRAPE-6A of the RIT cluster. 

T grapc (s)=iVT pipc [s/48] , 

Tcomm (s) = 60 Us + 56 tfS + 72 tjS = 188 tjS , (23) 

with [x/y] = (int)(x/y) + 1. Since the GRAPE pipeline accepts a maximum of 
48 particles, T grape is a step function with a step in correspondence of multiples 
of 48 in block size. 

Fig. 11 shows a comparison between the predicted times (solid lines) for given 
block sizes s and timing measurements (data points) conducted on the RIT 
cluster. The different plots refer to Plummer models with A^ = 8 K, 16 K, 64 K, 128 K. 
Given the errors on the timing measurements and on the parameters reported 
in Table 1, there is good agreement between the data and the performance 
model. For the A^ = 8 K model, the time spent in communication between 
the host and the GRAPE is larger then the time spent on the GRAPE itself 
for the force calculation. For A" > 16 K, the time for the force calculation 
becomes larger then the communication time and for the A" = 128 K system 
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the total execution time is dominated by the force calculation on the GRAPE. 
This shows that the use of GRAPE hardware for A-body simulations is most 
efficient in the case of large systems, which are the most interesting from the 
scientific point of view. The non-linear increase in the host time for block sizes 
larger than about 1000 is likely due to cache misses. 

The prediction and the measurements are independent of the chosen A-body 
model as long as the execution time is expressed as a function of the block 
size. In order to predict the execution time for the integration of a system 
over one A-body unit (or any other physical time), it is necessary to know 
the block size distribution for the model under consideration. In the case of a 
Plummer model with given A, the total execution time over one A-body time 
unit can be estimated by considering the average value of the block size (s) 
and the total number of integration steps n steps in one A-body unit, 

T N = T N ((s)) 

^steps • 

(24) 



We have measured the average block size and the number of steps in one 
A-body unit for Plummer models of different A and applied Eq. 24 to the 
prediction of the total execution times for the same models. Fig. 12 shows a 
comparison betweeen the predicted execution time for the integration of a 
Plummer model over one A-body unit and timing measurements on a single 
GRAPE-6A. The model satisfactorily predicts the time spent on the host, on 
the GRAPE and in communication for particle numbers A > 2 K. For smaller 
A, deviations from the prescription given by Eq. 24 are more likely to occur 
and to affect the modeling. In particular, the block size is generally small and 
the average is generally not a good representation of the global behavior of 
the system. 



5.2 Performance modeling of the parallel GRAPE code 

In the case of the hybrid scheme, the total execution time can be written as 
T = Thost + Tgrapc + T comm + Tmpi (25 



where Tmpi indicates the time spent in communication among the nodes. 
If s is the block size at a specific step during the integration and s max = 
maxj =1 p {sj} is the maximum of the local blocks on the different nodes, the 
time spent on the host is given by 

Tiost Tp rec [ (s max ) T corr (s max ) Tp rcc [ (1) S max ~\~ T corT (1) S max , (^6) 



24 



10 4 B-rrj 




0.01 ^ 1 — 1 — 1 ^ 

1000 10 4 10 5 

N 

Fig. 12. Comparison between the predictions by the theoretical model and the timing 
measurements for the integration of Plummer models over one iV-body unit. The 
long-dashed line indicates the time spent on the GRAPE for the force calculation, 
the dotted line indicates the time spent in communication between the host and 
the GRAPE, the dashed line indicates the time spent on the host and the solid 
line represents the total execution time given by the sum of the three separate 
times. The data points indicate the results of the timing experiments conducted on 
a GRAPE-6A of the RIT cluster. 

the time spent on the GRAPE is given by 
N 

T grap e = -T pipc [ S /48] , (27) 

the time spent in communication between the host and the GRAPE is 

T comm = 60 t t s + 56 t f s + 72 1 3 s max , (28) 

and the time spent in communication among the nodes is given by the sum of 
the time spent in each MPI call. The time Tmpi is dominated by two calls to 
the function MPI_Allreduce and three calls to the function MPI_Allgatherv. 
We adopt the following models for the MPI functions: 

^MPUllgatherv = (a + (3 x) log 2 p, 
^MPI_Allreduce = 

{5 + 1 x)log 2 p (29) 

where x represents the size of transferred data measured in bytes and a, (3, S, 
7 are parameters obtained by fitting timing measurements on the RIT cluster 
(see Table 2). 
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Table 2 

Fit parameters for modeling of the MPI functions. 



a [sec] 


[sec] 


5 [sec] 


7 [sec] 


1.2 x i(r 5 


2.5 x 1(T 9 


1.0 x 10" 5 


1.0 x 10~ 8 



Fig. 13 reports the comparison between the prediction for the total execution 
time as a function of the block size at one specific step and the timing results on 
the RIT cluster. As in the case of the theoretical model, the times spent on the 
host, the GRAPE, in communication with the GRAPE and in communication 
among the nodes are measured separately and then added together. 
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Fig. 13. Comparison between the execution time predicted by the model for the 
integration of a Plummer model with iV =65536 (left) and N = 131072 (right) and 
timing experiments on the RIT cluster. The total execution time is plotted as a 
function of the block size. 

In order to estimate the total execution time of the parallel scheme over one 
iV-body unit, we consider the average value of the block size (s) and the total 
number of integration steps n st e P s in one iV-body unit for Plummer models 
with different particle numbers. Similarly to the case of the sequential code, 
we can approximate the total execution time for a particle number N and 
processor number p as 



Tn, p — Tn, p ({s)) n stcps . 



(30) 



Fig. 14 shows a comparison betweeen the predicted execution time for the 
integration of different Plummer models over one iV-body unit and timing 
measurements. For the theoretical prediction we have used Eq. 30 and mea- 
sured values for the average block size and the number of steps in one iV-body 
unit. The agreement between the model and the data is good for large particle 
numbers while deviations appear for small N. 
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Fig. 14. Comparison between the predictions by the theoretical model (solid lines) 
and the timing measurements (data points) for the integration of Plummer models 
over one iV-body unit. The different lines refer to different processor numbers. 

6 Discussion 

The performance of a direct-summation, parallel iV-body code has been eval- 
uated on two new computer clusters incorporating GRAPE special-purpose 
accelerator boards. Parallelization was carried out using a hybrid method in- 
corporating aspects of both systolic ("ring") and broadcast ("copy") algo- 
rithms, in order to make the most efficient use of the GRAPE pipelines. The 
algorithm exhibits asymptotically an O(Np) scaling in communication com- 
plexity and an 0(N 2 /p) scaling in computation time. Benchmark simulations 
were carried out using a set of galaxy models having realistically high degrees 
of central concentration. Using one million particles and 32 nodes, the clus- 
ters achieved a sustained performance of 50% - 100% of the theoretical peak 
(~ 4 TFlops). When run on general-purpose parallel computers, iV-body codes 
like ours typically only achieve a few per cent of peak performance; hence, our 
special-purpose computer clusters are competitive with the fastest computers 
in the world when applied to the gravitational iV-body problem. 

We presented a simple model that predicts the performance of the iV-body 
code as a function of processor number p, particle number N, and hardware 
constants. The model reproduces the observed performance very well, and 
can be used to predict the performance of the code on clusters with different 
hardware characteristics. The model supports our finding that the perfor- 
mance depends critically on having very fast and low-latency communication 
hardware. 
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It is convenient to discuss the performance of the parallel iV-body code in 
terms of a working point P defined by a pair of values P = (p, N), where P 
is given by the condition that the time required for communication and com- 
putation be approximately equal. When using all 32 of the processors on our 
clusters, this point is reached for iV 10 6 . For larger N, the system operates 
at or near optimal speedup and efficiency (see Figs. 6 and 7). At a given p, the 
linear scaling of total computation time with N at low N (communication- 
dominated) changes to the asymptotic, N 2 scaling (computation dominated) 
roughly at P. Increasing N beyond its value at P would still achieve near- 
optimal speedups, but there would be room to increase p without sacrificing 
efficiency (if more nodes were available). Increasing p beyond P at fixed N 
is not useful because communication overhead will start to dominate, leaving 
the GRAPEs idle for part of the time. The design goal of our clusters was 
to simulate ~ 10 6 particles at high efficiencies, and we have shown by our 
performance tests that this goal has been achieved. 



o 

-+-> 

o 



X 

Z5 



-t-> 
Cl 



X 



ID 

d 



o 




000 



N 

Fig. 15. This figure demonstrates what particle numbers are needed to achieve the 
separation of time scales discussed in the text, when simulating nuclei containing a 
central sink, e.g. a single or binary BH. Vertical axis is the fraction of the flux into 
the "sink" originating from orbits that are in the empty-loss-cone regime; in real 
galaxies, this ratio is close to unity. Curves are labelled by rLc/ r h, where tlc is 
the linear size of the central "sink" and is the BH influence radius. For a binary 
SBH, 0.1 < r LC /r h < 10" 3 . 



How large an N is required to accurately represent the evolution of a dense 
stellar system like a galactic nucleus? A minimum condition is that N be large 
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enough to enforce a strict separation of time scales between the period T OTb 
of an orbit and the relaxation time T re f] the l atter is the ti me for two-body 
gravitational scattering to randomize velocities Js^l^ . In real galaxies, 



T re i 3> Tor},, i.e. the integrity of stellar orbits is maintained for many orbital 
periods. In an iV-body simulation, one has 

rp ^O.liV 

■J-rel ~ , Ar -'orb K^^J 

InJV 



( Aarsethlliffll . and the separation of time scales can be achieved with modest 



particle numbers, N > 10 3 . Relaxation-driven processes like core collapse can 
be simulated using iV-body codes and the numer ical evolution rates scaled to 
those in real gala xies using an equation like (31) ([Spurzem and Aarseth 19961: 
Szell et aUbOOfiji . 



In simulations of galactic nuclei, a much stronger separation of time scales 
must be achieved if the results are to be scaled to real systems, implying 
larger values of N. Galactic nuclei contain "sinks," regions near the center 
where stars are lost or captured. For instance, the supermassive black hole 
(BH) at the center of the Milky Way galaxy disrupts stars that pass within 
a distance rudai ~ 10~ 6 pc of it. Another example is a galaxy containing a 
binary supermassive BH: stars that pass within a distance ~ a of either BH 
are ejected by the gravitational slingshot, where 1CT 3 pc < a < 1CT 1 pc is 
the semi-major axis of the binary. In this case, the radius of the "sink" is 
~ a ^> r tida i. Single or binary supermassive BHs are believed to be generic 
comp onents of galaxies ( Ferrarese and Ford 20051 : Merritt and Milosavlievic 



2005), and the structure and evolution of nuclei may be sh own to depen d 



critically on the rate at which stars are lost to the central sink ( Merritt 2006). 



In real galaxies, relaxation times are long enough that most stars would diffuse 
gradually - i.e., over many orbital periods - onto so-called loss-cone orbits that 
intersect the capture sphere of radius rudai or a. This means that the loss cone 
is essentially empty, since the removal time is much shorter than the diffusion 
time. In order to reproduce a diffusive loss-cone repopulation in an A^-body 
simulation, the relaxation time must be large enough that 



T re i > T orb 3> T orb (32) 

\r LC ; 



(Li ghtman and Shaoirol fl977h . Here, r LC is the radius of the disruption 



rtidai) or ejection (~ a) sphere at the center of the galaxy, and r is the typical 
size of a stellar orbit. Equation (32) reflects the fact that scattering into a loss- 
cone orbit occurs in much less than one relaxation time, hence maintaining 
an empty loss cone requires that relaxation times be much longer than orbital 
periods. This is a stricter condition than T re i 3> T or b and implies a larger N. 
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Fig. 15 presents a more careful analysis along these lines. Shown there is 
the fraction of the flux into the central sink originating from orbits that are 
in the empty loss cone regime, as a function of iV and r^c] the latter is ex- 
pressed in units of r^, the radius of gravitational influence of the central (single 
or binary) BH of mass M,. (The influence radius is the radius containing a 
mass in stars equal to twice M..) In the case of a binary supermassive BH, 
^Lc/ r h ~ a/^h is roughly lCT 1 at the time of binary formation, falling to 
~ when the two BHs are close enough to coalesce. Fig. 15 suggests that 
particle numbers accessible to a single GRAPE-6 (~ 0.25M) can only repro- 
duce the empty loss cone regime characteristic of real galaxies during the early 
phases of binary evolution. The values of N achievable on gravitySimulator 
(~ AM) allow the evolution of a binary to be followed to separations ~ 1/10 
as small. These estimates are consistent with the results of recent simulations 
which reproduce diffusion-limited evolution of massive binaries with N m 10 6 
(Making and Funatoll2004 : Berczik et al . 2005). Simulating loss cone dynamics 



around the much smaller tidal disruption sphere of a single BH (ric ~ 10 _6 r^) 
would require considerably larger N, well beyond the capabilities of existing 
or planned computers. However the dynamics in this regime could be quali- 
tatively reproduced by adjusting the size of the capture sphere in an iV-body 
simulation such that the bulk of the stars scattered into the BH are on orbits 
respecting the correct ratio of T re i to T or b. 

The T re i ~ iV scaling of Eq. (31) implies an effectively ~ iV 3 scaling for cal- 
culations that extend over one relaxation time of the system. This scaling 
makes it very expensive to simulate the "collisional" evolution of stellar sys- 
tems using the full, 4 x 10 6 particle number allowed by the combined GRAPE 
memories on the RIT and ARI clusters. One way to accelerate the computa- 
tions without significant loss of accuracy would b e to use the Ahmad-Cohen 
(AC) neighbor scheme ( Ahmad and Cohen 19731 ). as implemented in codes 



like NBODY5 or NBODY6 (lAarsethlllQQQi Tln the AC scheme, the full forces 



are computed only every tenth timestep or so; in the smaller intervals, the 
forces from nearby particles (the "irregular" force) are updated using a high 
order scheme, while those from the more distant ones (the "regular" force) are 
extrapolated using a second-degree polynomial. A parallel implementation of 
NBODY6, inclu ding the AC scheme, exists, but on ly for general-purpose par- 
allel computers ( Spurzem 19991 : Khalisi et al. 2003^ 1 ; the algorithm has not yet 



been adapted to systems with GRAPE hardware. 

Greater speedups could also be achieved by increasing the number p of proces- 
sors, but only if communication costs are held low. One way to do this is via 



a variant of the lLippert et al.l (|1998a.b) hyper-systolic (HS) algorithm. In its 
basic form, the HS algorithm reduces the number of data transfers by storing 
the shifted data on each node. The number of shifts, and hence the communi- 
cation time, is reduced from 0(p) to O(^fp) and the memory requirements are 
increased by a similar factor. The HS speedup is smaller when only a subset of 
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the fu l l N part i cles a re advanced at every time step however ([Dorband et al. 



2003h . iMakinol {2002) has presented a direct A-body summation code opti- 



mized for a quadratic layout of processor (p required to be a square number), 
which is in fac t a simp lified version of the hypersystolic algorithm proposed by 
Lippert et all (|l998bh . In Makino's case the asymptotic scaling of communica- 
tion is 0(N/ y/p) (for calculation cost there is no difference to our code), which 
would allow to use much larger numbers of processors. However, performance 
tests of a HS iV-body algorithm on actual hardware have apparently not been 
carried out. 

Implementation of the AC and/or HS schemes should permit effective use of 
the RIT and ARI clusters with the full particle numbers permitted by the 
combined GRAPE memories. Still larger N could be attained by implement- 
i ng a hybrid scheme, e.g. cou pling a direct-summation algorit hm with a tree 



mg a nybria scneme, e.g. cou pling a direct-summation algorit nm witn a tree 
( McMillan and Aarsethlll993| ) or basis- function representation ( Hemsendorf et al 



20021 ) for the distant particles, or a hierarchical generalization of the Ahmad- 



Cohen neighbor scheme. 
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